LanguageRefer:
Spatial-Language Model for
3D Visual Grounding

Junha Roh, Karthik Desingh, Ali Farhadi, Dieter Fox

Conference on Robot Learning (CoRL 2021) [arxiv][pdf][bibtex][code]

Abstract

To realize robots that can understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that can understand referential language to identify common objects in real-world 3D scenes. In this paper, we develop a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of a point cloud with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model identifies the target object from a set of potential candidates. Our spatial-language model uses a transformer-based architecture that combines spatial embedding from bounding-box with a finetuned language embedding from DistilBert and reasons among the objects in the 3D scene to find the target object. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D. We provide additional analysis of performance in spatial reasoning tasks decoupled from perception noise, the effect of view-dependent utterances in terms of accuracy, and view-point annotations for potential robotics applications.

Bibtex

@inproceedings{Roh2021Language,
title={{L}anguage{R}efer: Spatial-Language Model for 3D Visual Grounding},
author={Junha Roh and Karthik Desingh and Ali Farhadi and Dieter Fox},
booktitle={Proceedings of the Conference on Robot Learning},
year={2021},
}

3D Visual Grounding Task: ReferIt3D

ReferIt3D proposed a 3D visual grounding task with the dataset with referring language annotation on a 3D reconstructed indoor scene dataset, ScanNet. Following the official splits of ScanNet, ReferIt3D provided natural language annotation of referring one of 76 target classes which is called Nr3D. It also provides template-based referring language on spatial relationship, Sr3D. Based on the datasets, it proposed the 3D visual grounding task. From a scene, ground-truth bounding-boxes, corresponding pointclouds, and the language description of the target object are given. The goal of the task is to choose one bounding-box from candidates.

Examples of ReferIt3D

We visualized some example scenes (scene0011_00, scene0231_00, scene0141_00) with bounding-boxes, utterances, and corresponding target bounding-boxes in Figure 1-3. Figure 1A, 2A, and 3A show all bounding-boxes (red) in the example scenes. Some bounding-boxes are highlighted in blue when they belong to the selected target class for each scene. For instance, two bounding-boxes of tables are in blue since we choose table as the target class for visualization. Rest of figures (Figure 1B, 1C, 2B, 3B) show the individual bounding-box of objects (yellow) in the target classes. We add virtual robot paths for the application of robot navigation. Red dots indicate the random starting positions and yellow dots demonstrate paths to reach the target object. Each caption contains an example utterance for the corresponding target object.

Figure 1A. Two bounding-boxes of tables are shown in blue while the bounding-boxes of other classes in red.

Figure 1B (a). A Yellow bounding-box shows the ground-truth bounding-box with the utterance of "this is the large conference table with many chairs."

Figure 1C (a). A virtual robot path from a random position in the scene to the target object from Figure 1B (a) is shown in yellow dots. A red dot indicates the starting position.

Figure 1B (b). A Yellow bounding-box shows the ground-truth bounding-box with the utterance of "the desk directly below the board on the wall."

Figure 1C (b). A virtual robot path from a random position in the scene to the target object from Figure 1B (b) is shown in yellow dots. A red dot indicates the starting position.

Figure 2A. Four bounding-boxes of kitchen cabinets are shown in blue and the other bounding-boxes are shown in red.

Figure 2B (a). A Yellow bounding-box shows one of the bounding-box of kitchen cabinets with the utterance of "Kitchen cabinet to the left of the stove." A virtual robot path from a random position in the scene to the target object is shown in yellow dots. A red dot indicates the starting position.

Figure 2B (c). A Yellow bounding-box shows one of the bounding-box of kitchen cabinets with the utterance of "The cabinets under the sink." A virtual robot path from a random position in the scene to the target object is shown in yellow dots. A red dot indicates the starting position.

Figure 2B (b). A Yellow bounding-box shows one of the bounding-box of kitchen cabinets with the utterance of "This upper cabinet is between the stove and sinks." A virtual robot path from a random position in the scene to the target object is shown in yellow dots. A red dot indicates the starting position.

Figure 2B (d). A Yellow bounding-box shows one of the bounding-box of kitchen cabinets with the utterance of "Of the two bottom cabinets, choose the one on the right." A virtual robot path from a random position in the scene to the target object is shown in yellow dots. A red dot indicates the starting position.

Figure 3A. Three bounding-boxes of chairs are shown in blue and the other bounding-boxes are shown in red.

Figure 3B (a). A Yellow bounding-box shows one of the bounding-box of chairs with the utterance of "Chair closest to the door." A virtual robot path from a random position in the scene to the target object is shown in yellow dots. A red dot indicates the starting position.

Figure 3B (c). A Yellow bounding-box shows one of the bounding-box of kitchen cabinets with the utterance of "The chair is the red one closest to the window facing towards the blue ball and away from the desk." A virtual robot path from a random position in the scene to the target object is shown in yellow dots. A red dot indicates the starting position.

Figure 3B (b). A Yellow bounding-box shows one of the bounding-box of kitchen cabinets with the utterance of "select the chair pushed in at the desk." A virtual robot path from a random position in the scene to the target object is shown in yellow dots. A red dot indicates the starting position.

LanguageRefer

We propose LanguageRefer, a spatial-language model to solve the 3D visual grounding task. Our model focuses on understanding spatial relationship between objects from language description and 3D bounding-box information due to 1) high dependency on spatial relationship description in language and 2) holistic nature of spatial relationship understanding different from unary attributes such as color and shape.

Figure 4 illustrates four modules of the framework: classification, spatial embedding, language embedding, and spatial-language model. We follow two-stage approaches as many previous approaches did. Classification: given ground-truth bounding-boxes from input, we train PointNet++ to predict semantic class labels of the objects represented by these bounding-boxes and this constitutes the classification module. Language embedding: then the predicted class labels are concatenated to the input utterance joined by the special token [SEP] then tokenized and transformed into token embedding vectors by DistilBert tokenizer and embedding. Spatial embedding: Bounding-box information (position and size) is transformed by positional encoding and added to the language embedding. Spatial-language model: finally, we use transformer layers to produce scores from each object and choose the object with the maximum score.

Figure 5 shows a detailed process of the framework. For training of the model, we use multiple loss terms. For further detail, please refer to the paper. The reference classifier does the selection for the task defined in ReferIt3D and the other loss terms are auxiliary.

Figure 6 shows an inference procedure of the framework. At inference, we employ a target classifier to predict the target class from language utterance to filter out objects not related to the target class in the reference task.

Figure 4. Simplified overview of LanguageRefer. LanguageRefer is a model takes 3D pointcloud of a scene, bounding-boxes of objects in the scene, and grounding language description of a single object in the scene and refer the described object. It has four modules: classification, spatial embedding, language embedding, and spatial-language model.

Figure 5. Detailed overview of LanguageRefer. A semantic classifier predicts class labels from 3D pointcloud in each bounding box (using color and xyz positions). The language description or utterance (e.g. "Facing the foot of the bed, the bed on the right.'') is transformed into a sequence of tokens along with a sequence of predicted class labels. The input token embedding in DistilBert converts the tokens into embedded features vectors (green squares). Bounding-box position and size information is positional-encoded to form encoded vectors using technique from (orange squares) and are added to the corresponding embedded feature vectors (green squares). After the addition, the modified features are processed by our reference model based on DistilBert and fed to multiple tasks. The main task is a reference task which is to choose the referred object from the object features. The instance classification task is a binary classification of determining whether the given object feature belongs to the target class. Lastly, the masking task is to recover the original token from a randomly replaced token in the utterance which is commonly used in language modeling.

Figure 6. Detailed overview of LanguageRefer at the inference stage. At inference, we followed the approach of InstanceRefer to filter out objects that do not belong to the predicted target class. A target classifier takes the language utterance as input and predicts the target class. Filtering masks are generated by comparing the predicted target class to predicted class labels from the semantic classifier. In order to reduce the chance of removing the true target instance in the filtering process, top-k class predictions (from the semantic classifier) for each object are compared to the predicted target class (not shown in the figure). Filtering masks are applied to the output embeddings of the spatial-language model to refine objects only related to the predicted target class.

Qualitative Results of LanguageRefer

Figure 7A, 7B, 7C show the qualitative prediction result of LanguageRefer on the first example scene (scene0011_00) with natural language utterances. In Figure 7A and 7B, the model correctly chooses the target object given utterances in the test dataset as well as the custom utterances such as "a smaller table." Figure 7C shows the failure case where the custom utterance "table without any chairs around" is given but the model selected another table at the center of the room. Expressions such as without seem to be rare; our model correctly predicts all referred tables from corresponding utterances in the dataset:

"this is the large conference table with many chairs", "the desk directly below the board on the wall", "the biggest table in room", "the large table in the middle of the room", "the thin wooden table underneath the television and immediately to the left of the trash can", "smaller table against the wall", "select the long table in the middle of the room", "choose the table that is up against the wall", "the long conference table in the middle of the room", "choose the table that sits against the wall", "the largest table in the room", "select the table underneath the tv", "a small table below the television on the wall", "the very big dining table in the center of the room."

Figure 7A. LanguageRefer successfully selects the referred table at the center with a custom utterance "table with lots of chairs" as well as the utterance from the dataset "this is the large conference table with many chairs."

Figure 7B. LanguageRefer chooses the correct table on the side with a custom utterance "a smaller table" as well as the utterance from the dataset "the desk directly below the board on the wall."

Figure 7C. LanguageRefer chooses the incorrect table at the center (in red) when the desired target object is the smaller table (in yellow) with a custom utterance "table without any chairs around."

In Figure 8A, 8B, 8C, 8D, we asked the model to choose one of the three stacked boxes with natural language utterances. The model failed to select the top box (in yellow) and selected the box in the middle (in red) in Figure 8A. When we replace predicted class labels from PointNet++ by the ground-truth class labels, the model was able to choose the correct box on top in Figure 8B. However, the attempts to select the box in the middle failed with or without ground-truth class labels in Figure 8C. Figure 8D shows the successful reference of the bottom box by the model with predicted class labels. Now the robot paths are visualized with the color of prediction; green if correct, red otherwise.

We examined the predicted class labels of three boxes: microwave, box, box (from top to bottom). This caused an incorrect reference of the top box: 0/7. After fixing the incorrect class label by the ground-truth class label, the accuracy of the reference task of the top box got higher: 6/7. However, the ground-truth class label did not improve the accuracy of the reference task of the box in the middle: 0/6. The reference task of the bottom box was 6/6 before fixing the label. By providing ground-truth labels, we were able to disentangle reference errors from perception errors. From the three-box example, we found the model was not able to refer to the box in the middle of the vertical stack.

Figure 8A. LanguageRefer failed to select the top box from the utterance "of the three boxes stacked pick the top one." A yellow bounding-box shows the ground-truth target object and a red bounding-box shows the incorrectly selected target object.

Figure 8B. When we provide the ground-truth class labels to the model, it chooses the correct box from the same utterance as in Figure 8A.

Figure 8C. LanguageRefer chooses the incorrect box at the bottom (in red) with the utterance "middle box in the stack." indicating the ground-truth box in the middle (in yellow).

Figure 8D. LanguageRefer chooses the correct box at the bottom (in green) with utterances "the bottom box in the stack of three boxes." and "the large white box on the bottom."

In Figure 9A, 9B, 9C, we evaluated the accuracy of the reference task of kitchen cabinets with natural language utterances. Figure 9A and 9B show successful examples of reference. Figure 9C shows a failure case. Our model correctly selected 8 instances out of 10 cases.

Note that the top-5 class predictions of the kitchen cabinets are noisy:
['cabinet', 'cabinets', 'kitchen cabinets', 'bathroom cabinet', 'kitchen cabinet']
['cabinet', 'bathroom cabinet', 'cabinets', 'kitchen cabinet', 'kitchen cabinets']
['kitchen cabinets', 'cabinet', 'cabinets', 'kitchen cabinet', 'bathroom cabinet'].

It shows that our model is robust with subtle changes in class labels. Even without a unified format of similar classes including plurals, our model was able to accurately refer to the correct objects. We do not preprocess utterances except for tokenization; preprocessing of language expressions or transforming the utterance into a fixed form is not used.

Figure 9A. LanguageRefer successfully selects the referred kitchen cabinets with the utterance "a double sink is on top of this section of low cabinets."

Figure 9B. LanguageRefer chooses the correct kitchen cabinets with the utterance "higher section of kitchen cabinets."

Figure 9C. LanguageRefer chooses the incorrect kitchen cabinet on the left side (in red) when the desired target cabinets are in the middle (in yellow) with the utterance "lower cabinets next to stove."

Qualitative Comparison to ReferIt3D

We have compared the proposed method to ReferIt3D and Figure 10A-G show examples of predictions (on scene0699_00) with corresponding utterances in captions. Figure 10A shows some highlighted objects in the scene. The scene has a bed at the bottom of the image, a walk-in closet on top, a desk on its right. Blue bounding-boxes highlight bags and yellow bounding-boxes show backpacks. We examine utterances that select one of the bags in the bedroom.

Figure 10B and 10C show results from LanguageRefer and ReferIt3D respectively with the utterance "The light brown bag on the floor closest to the bed." The proposed method correctly selected the ground-truth bag (in green, in Figure 10B) while ReferIt3D chose an incorrect bag (in red, in Figure 10C).

Figure 10D and 10E show results with the utterance "It is the bag against the wall, not in the closet." While our approach successfully chose the intended bag, ReferIt3D chose the backpack nearby instead of choosing a bag. Confusion with objects of other classes often happened from ReferIt3D.

Figure 10F and 10G show results with the utterance "the bag in the closet next to the cardboard box." For the utterance of choosing a bag in the closet, both methods failed. They chose the same objects as with the utterance "It is the bag against the wall, not in the closet." We found that for all the utterances to select the bag in the closet, both methods never chose the intended bag. It happened when we provide the ground-truth class labels to our method and it was not caused by perception failure.

Figure 10A. Target objects (bag) in blue and non-target objects (backpack) in yellow.

Figure 10B. LanguageRefer chooses the correct bag (in green) with the utterance "The light brown bag on the floor closest to the bed."

Figure 10C. ReferIt3D chooses an incorrect bag (in red) with the utterance "The light brown bag on the floor closest to the bed." The ground-truth bag is in yellow.

Figure 10D. LanguageRefer chooses the correct bag (in green) with the utterance "It is the bag against the wall, not in the closet."

Figure 10E. ReferIt3D chooses an incorrect object, backpack (in red), instead of the ground-truth bag (in yellow) with the utterance "It is the bag against the wall, not in the closet."

Figure 10F. LanguageRefer fails to choose the correct bag (in yellow) but chooses an incorrect bag (in red) with the utterance "the bag in the closet next to the cardboard box."

Figure 10G. ReferIt3D chooses an incorrect backpack (in red) again, instead of the ground-truth bag (in yellow) with the utterance "the bag in the closet next to the cardboard box."

Orientation Annotation for View-Dependent Utterances

In addition to the proposed model, we have collected orientations of the view-dependent utterances. View-dependent utterances without information about the original viewpoint make the reference task challenging. For instance, utterances such as "The door is wood with the handle on the left side." assume specific orientations of the agent and it is impossible to recover the true orientation without knowing the referred object, not like view-dependent utterances with explicit view-point information such as "Facing the foot of the bed." However, the original dataset of ReferIt3D does not distinguish the utterances without orientation information from those with orientation information. Therefore, we split the view-dependent (VD) utterance category into two subcategories, VD-explicit and VD-implicit, where VD-explicit has explicit view-point information in the utterance. Then we collected orientations that make the utterances valid from human annotators.

We set four standard orientations assuming the agent is in the room (around the center of the scene) and ask the annotators to select all of the orientations that can be considered valid from the utterance. Figure 11 shows examples of four orientations. With the assumption of the agent being inside of the room, we found that four orientations are good enough to recover the original viewpoints of the speakers. In total, 12,680 view-dependent utterances of the Nr3D dataset were annotated. From those, 5,942 utterances are classified as VD-explicit. For train and test split, 10,206 and 2,474 utterances were annotated.

(a) An example of the standard orientation 1

(b) An example of the standard orientation 2

(c) An example of the standard orientation 3

(d) An example of the standard orientation 4

Figure 11. Examples of standard orientations for view-point annotation (a-d).We assume that the robot is always inside of the room except for the cases specified by utterances

Orientation Annotation Webpage

We also provide the orientation annotation webpage (link above). It is recommended to use a computer or a laptop to view the page. Please do not click the view-annotation checkboxes (it is still alive so clicks can affect the actual annotation data.) The first page shows the links to all the scenes for annotation. If you click one scene, you can see the list of utterances with some flags. By clicking a single utterance, all the bounding-boxes with highlighted target bounding-boxes are rendered. A green bounding-box is the ground-truth target and the red bounding-boxes are distractors that belong to the target class but not the target object instance. Then press any number in {1, 2, 3, 4} on the keyboard. You can see the scene with the canonical orientation of your selection. You can zoom in/out, translate, rotate with your mouse. We collect annotations on utterances only with correct guesses from humans and mention the target class.

Note that, in the process of ReferIt3D annotation, the ground-truth target class and the distinguished bounding-boxes of the target class are provided to the annotators. We also provide those information to the annotators. However, in the actual task of ReferIt3D, the model does not have access to the target class and many other bounding-boxes from other classes are given as you can see in Figure 1A. If some utterances are assuming the shared view or orientation of speakers, the ambiguity can be easily resolved by human listeners since they have extra information and they can manipulate orientation as well. However, the same assumption can make the reference task even more challenging because the model needs to verify some hypothetical orientations with uncertainty of classes among multiple candidates.

Figure 12. Interface of the orientation annotation. On the left side, a 3D visualization of the scene with overlaid bounding-boxes and class labels is provided. On the right side, a table of utterance information with orientation annotation checkboxes is shown. By clicking each row in the table, the highlights of the bounding-boxes in 3D visualization is changing with respect to the clicked utterance. A green bounding-box is the true target object while red bounding-box(es) is the distractor object in the same class. Flags 'Correct Guess' and 'Mentions Target Class' indicate whether utterances are considered valid (according to the official ReferIt3D evaluation). Flags 'View-Dependency' and 'Use Language' provide information about the utterances. 'Sp', 'Cl', 'Sh' indicate whether the utterance is using spatial relationship, color, and shape, respectively. For instance, the utterance "The black paper towel dispenser above the white dispenser and to the right of 2 white sinks." uses color ("black" and "white") and spatial relationship ("above", "to the right of").

Ablation of Positional Encoding and DistilBert

In order to examine the effect of the spatial encoding (sinusoidal positional encoding function) and the base language model (DistilBert), we have trained ablations models.

In the first ablation model, we replaced the sinusoidal positional encoding function with a linear layer to transform a 6-dimensional input vector of bounding-box information to a 768-dimensional embedding vector. In the second ablation model, we keep the spatial encoding to the positional encoding function but replaced DistilBert with BERT.

The first ablation model of the positional encoding achieved 37.8 % accuracy on Nr3D while our final model achieved 43.9 % on Nr3D. It shows that selection of an effective spatial encoding scheme such as the sinusoidal is important.

The second ablation model with BERT as the base language model achieved 45.3 % accuracy on Nr3D which is slightly higher than the accuracy of the model with DistilBert (43.9 %). However, when it was trained on Sr3D, it only achieved 49.3 % while the original model achieved 56.0 %. We observed instability in training with the BERT model. We also observed that the successfully trained model on Nr3D converged faster than the DistilBert-based model. During the training, the total loss of the model on Sr3D surged in the middle and did not recover. While these are some observations during our ablation, the second study with BERT model is inconclusive. We chose DistilBert-based model in our framework is because it is lightweight and easy to train. Our goal is to develop a modular approach that can be easily modified based on the advancements in the area of learning embeddings, especially in NLP and computer vision.